fix(compute): explicit kaniko resources + imagePullPolicy Always#2
Merged
Merged
Conversation
Two related fixes that an agent hits when redeploying via /deploy/new: 1. Kaniko build pod was sized by the per-namespace LimitRange (hobby tier default: 50m CPU / 256Mi RAM). Non-trivial npm install hung 5+ minutes. Now sets explicit Requests=250m/256Mi and Limits=1/512Mi on the kaniko container. Build runs as a Job, so the higher allocation does not inflate the app's permanent quota. 2. App container had imagePullPolicy=IfNotPresent on the :latest tag. After a redeploy pushed a new image to GHCR, k8s served the cached old layer because the tag string was unchanged. Redeploys silently succeeded while pods kept the prior code. Switched to PullAlways. Also makes the K8sProvider.clientset field a kubernetes.Interface so tests can inject fake.NewSimpleClientset. Tests cover the resource floor + PullAlways guarantee. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mastermanas805
added a commit
that referenced
this pull request
May 11, 2026
…s package (#12) User-facing pain that drove this: every domain rename or proxy hostname tweak required scraping the codebase with grep + sed. The instant.dev → instanode.dev rename earlier today touched 28 files and still missed some (we're still finding stragglers). The "Use named constants, not inline strings" memo applies — extracting was overdue. New package internal/urls/urls.go centralises: Public hostnames: PublicAPIBase = "https://api.instanode.dev" PublicMarketingBase = "https://instanode.dev" StartURLPrefix = PublicMarketingBase + "/start" DeploymentWildcard = "deployment.instanode.dev" StoragePublicHost = "s3.instanode.dev" Cluster-internal proxy FQDNs (for in-cluster workloads — friction PR #2): InternalPGProxy = "instant-pg-proxy.instant.svc.cluster.local:5432" InternalRedisProxy = "instant-redis-proxy.instant.svc.cluster.local:6379" InternalMongoProxy = "instant-mongo-proxy.instant.svc.cluster.local:27017" InternalNATSProxy = "instant-nats-proxy.instant.svc.cluster.local:4222" InternalMinIO = "minio.instant-data.svc.cluster.local:9000" Helper: UpgradeStartURL(jwt) — canonical builder for /start?t=<jwt>, replaces 12 sites that had identical fmt.Sprintf calls. Call sites refactored: - 12 fmt.Sprintf upgrade-URL calls across 6 provisioning handlers → urls.UpgradeStartURL(jwtToken) - proxiedInternalURL in handlers/internal_url.go → uses 4 urls.Internal* - upgradeNote / limitExceededNote bare links → urls.StartURLPrefix - canonicalAPIBase (handlers/auth.go) → alias of urls.PublicAPIBase - defaultCanonicalResourceURL (middleware/auth.go) → alias of urls.PublicAPIBase (eliminates the long-standing duplication of the same string in two different files) - 15 user-facing error strings ("Sign up at https://instanode.dev/start") → string concat with urls.StartURLPrefix Audit results: Before: 28+ occurrences of "instanode.dev/start" in non-test handler code After: 1 occurrence (an inline example URL inside the OpenAPI JSON description string — appropriate to leave as a literal since it's sample documentation, not a runtime-constructed URL) Before: 4 proxy FQDN string literals in handlers/internal_url.go After: 0 (all 4 use urls.* constants) Tests: internal/urls/urls_test.go (12 sub-tests): TestPublicHostnames_MatchExpectedShape — guards 5 public hosts TestInternalProxyHostnames_CorrectPortsAndService — guards 5 internal TestUpgradeStartURL_Composition — builder semantics Existing tests still pass: TestProxiedInternalURL — 7 sub-cases unchanged after refactor TestUpgradeNote_DoesNotMentionTrial / TestLimitExceededNote_… — 6 cases TestWhoami_*, TestDeployNew_EnvVars_*, all TestOpenAPI_* — pass Live verification on v2.1.0-urls-package in prod: POST /db/new (fresh) → upgrade host = "instanode.dev" internal_url host = "instant-pg-proxy.instant.svc.cluster.local:5432" note starts with "Works for 24h free" GET /api/v1/whoami → 200 with team_id + plan_tier + team_name (proves the middleware audience lookup still works via the urls.PublicAPIBase alias) Out of scope (separate follow-ups): - Email templates in internal/email/email.go (45 occurrences) — these are marketing copy strings with their own care; need different extraction approach (probably a templated config). - SDK references (sdk-go, sdk-node, mcp) — each is its own repo; consolidating across modules needs a shared common/urls package and cross-repo coordination. - Worker / provisioner repos — they have far fewer URL string instances and can do their own thing. - The OpenAPI bearerAuth description's inline example URL — sample documentation, not a runtime URL. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mastermanas805
added a commit
that referenced
this pull request
May 16, 2026
…tract bugs
The api Deploy workflow had been red on every run. Root causes, all fixed:
Schema-mirror drift (the bulk of the ~25 failures)
testhelpers.runMigrations hand-mirrors the prod schema; CI ran against a
bare Postgres so any migration that added a table/column without a mirror
edit silently broke the gate. Completed the mirror (email_events,
pending_deletions, deployment_events, api_keys, magic_links, custom_domains,
service_components, uptime_samples + 9 drifted columns) and added
TestRunMigrationsMirrorsEveryMigrationTable — a guard that enumerates every
CREATE TABLE in the migration files and fails if one is unmirrored.
deploy.yml now also applies the real migration files before testing, so CI
runs the same schema developers do (kills column drift too).
Genuine product/test bugs the lax mirror had been masking:
- rate_limit.go / idempotency.go: a nil *redis.Client SIGSEGV'd the whole
API on the first request. Both now fail open (CLAUDE.md convention #1).
- cache/redis.go provisionLocal/StorageBytes: nil Redis client now returns
a clean error → handler 503, never a panic (convention #2).
- models.AcceptInvitation: did not clear is_primary when moving an invitee
into a team, violating uq_users_one_primary_per_team → 500. Now cleared.
- resource_elevate_test.go inserted stack/deployment rows missing the real
NOT NULL columns (namespace, app_id); now valid.
- Fixed 5 handler tests with stale tier/seat/mock assumptions and dropped
the stale -skip list (TestOpenAPI/TestCrossTeam/TestCustomDomainCreate
all pass now).
go test ./... -short: 25 packages, 0 FAIL, 0 panic. build + vet clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mastermanas805
added a commit
that referenced
this pull request
May 20, 2026
…t no-op Two contract-drift P0s flagged in the bug-hunt brief, both shipped in the same PR because they share the test gate. B13-P0-F1: POST /auth/cli auth_url leaked dead-brand * Old: frontendURL() returned "https://instant.dev" in production — instant.dev is the dead-brand marketing host (404s). An agent following the auth_url landed on a parking page and gave up. * Fix: read cfg.DashboardBaseURL (DASHBOARD_BASE_URL env var); fall back to "https://instanode.dev" when DASHBOARD_BASE_URL is unset in production. internal/handlers/cli_auth.go. * Coverage: TestAuth_CLI_ReturnsInstanodeDomain (4 cases — explicit base, empty-fallback, dev default, trailing-slash strip) + TestAuth_CLI_NoLegacyInstantDevInResponses (rule-17 source-scan that fails if a new handler emits instant.dev/<x> anywhere). * Also cleaned: internal/handlers/logs.go (error message URL), internal/handlers/openapi.go (storage endpoint example). B7-P0-1: PATCH /stacks/:slug/env was a silent no-op * Old: handler logged stack.env.noted, returned 200, persisted nothing — the next /stacks/:slug/redeploy rebuilt with stale env. Silent-data-loss failure mode. * Fix (Option A from the brief): migration 062 adds stacks.env_vars JSONB DEFAULT '{}'::jsonb. Handler now loads existing env, merges (empty-string value deletes), validates every key against isValidEnvKey (POSIX [A-Z_][A-Z0-9_]*), persists, emits stack.env.updated audit_log row, and returns the full merged map. * 64KiB cap on serialized payload (ErrStackEnvVarsTooLarge → 413). * OpenAPI updated to describe PATCH semantics + new 409/413 responses + the merged-env response shape. * Coverage: TestStack_PatchEnv_PersistsAndReturns (7 subtests — auth, first patch, merge, delete-via-empty, invalid_env_key, missing_env, 404 unknown slug). All assertions verify the DB round-trip — pre-fix the handler returned 200 but persisted nothing, which is exactly what subtest #2 now fails on. Coverage block per CLAUDE.md rule 17: Symptom: instant.dev/<x> in handler-emitted strings Enumeration: rg -nF "instant.dev/" internal/handlers/ --type go Sites found: 3 (cli_auth.go:339, logs.go:136, openapi.go:2784) Sites touched: 3 — all three migrated to instanode.dev Coverage test: TestAuth_CLI_NoLegacyInstantDevInResponses scans every .go file under internal/handlers/ and fails if any non-test, non-comment, non-label, non-import-path mention of instant.dev/<x> appears. Live verified: pending — see follow-up commit after deploy Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two friction-point fixes that an unaided AI agent hits when redeploying via `/deploy/new`:
1. Kaniko build pod was throttled to 50m CPU / 256Mi RAM
The per-namespace `LimitRange` (hobby default) was the only thing sizing the kaniko container — non-trivial `npm install` hung 5+ minutes. Sets explicit `Requests=250m/256Mi` and `Limits=1/512Mi` on the kaniko container. Build runs as a Job, so the bigger allocation doesn't inflate the app's permanent footprint.
2. `imagePullPolicy: IfNotPresent` + `:latest` tag → silent stale redeploys
When kaniko pushed a new image to GHCR under the same `:latest` tag, k8s served the cached old layer because the tag string was unchanged. Redeploys appeared to succeed while pods kept the prior code. Switched to `PullAlways` everywhere (single-app + stack).
3. `K8sProvider.clientset` is now `kubernetes.Interface`
Drop-in change — `*Clientset` already implements `Interface`. Enables tests to inject `fake.NewSimpleClientset`.
Test plan
🤖 Generated with Claude Code